Implement sharding and device mesh debug tool #2328

fegin · 2026-02-05T06:28:24Z

Stack from ghstack (oldest at bottom):

-> Implement sharding and device mesh debug tool #2328

Why this PR

Understanding different DTensor sharding is import especially when doing full dtensor project. Creating this tool for debugging purpose.

What this PR does

This PR adds a sharding debug tool that captures and visualizes DTensor sharding information during training. When enabled via debug.log_sharding_info=True, it registers forward and backward hooks on all modules to record tensor placements, device mesh info, and shapes for one forward/backward pass. The tool outputs both a formatted ASCII text file and an interactive HTML visualization.

Limitation:

This tool can only track 1) module inputs/outputs and the gradients and 2) module states and the gradients. Any activation that generate by ops that is not a module can not be tracked. We will have to use TorchFunctionMode or TorchDispatchMode to do this.

For Reviewers:
UX functions (ASCII and html) are completely generated by Claude. I'm not an experienced frontend developer and didn't code review too much html file.

NGPU=8 COMM_MODE=fake_backend CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh  --parallelism.tensor_parallel_degree=8 --debug.log_sharding_info

[ghstack-poisoned]

**Why this PR** Understanding different DTensor sharding is import especially when doing full dtensor project. Creating this tool for debugging purpose. **What this PR does** This PR adds a sharding debug tool that captures and visualizes DTensor sharding information during training. When enabled via `debug.log_sharding_info=True`, it registers forward and backward hooks on all modules to record tensor placements, device mesh info, and shapes for one forward/backward pass. The tool outputs both a formatted ASCII text file and an interactive HTML visualization. **Limitation:** This tool can only track 1) module inputs/outputs and the gradients and 2) module states and the gradients. Any activation that generate by ops that is not a module can not be tracked. We will have to use TorchFunctionMode or TorchDispatchMode to do this. **For Reviewers:** UX functions (ASCII and html) are completely generated by Claude. I'm not an experienced frontend developer and didn't code review too much for those files. ``` NGPU=8 COMM_MODE=fake_backend CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --parallelism.tensor_parallel_degree=8 --debug.log_sharding_info ``` ghstack-source-id: bcc7749 Pull-Request: #2328

[ghstack-poisoned]

**Why this PR** Understanding different DTensor sharding is import especially when doing full dtensor project. Creating this tool for debugging purpose. **What this PR does** This PR adds a sharding debug tool that captures and visualizes DTensor sharding information during training. When enabled via `debug.log_sharding_info=True`, it registers forward and backward hooks on all modules to record tensor placements, device mesh info, and shapes for one forward/backward pass. The tool outputs both a formatted ASCII text file and an interactive HTML visualization. **Limitation:** This tool can only track 1) module inputs/outputs and the gradients and 2) module states and the gradients. Any activation that generate by ops that is not a module can not be tracked. We will have to use TorchFunctionMode or TorchDispatchMode to do this. **For Reviewers:** UX functions (ASCII and html) are completely generated by Claude. I'm not an experienced frontend developer and didn't code review too much for those files. ``` NGPU=8 COMM_MODE=fake_backend CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --parallelism.tensor_parallel_degree=8 --debug.log_sharding_info ``` ghstack-source-id: 5354c4e Pull-Request: #2328

wwwjn

This is general useful for all DTensor developers, not only titan developers (and we would expect infra developers would be interested in this tool, not model researchers). Should we put it in PyTorch , like flight_recorder?

fegin · 2026-02-06T04:19:42Z

This is general useful for all DTensor developers, not only titan developers
This is True.

(and we would expect infra developers would be interested in this tool, not model researchers)

Most our users also care about scaling, so this is useful for them too.

Should we put it in PyTorch , like flight_recorder?

The reason why I don't put it in PyTorch, is because this tool is not general enough yet. The tool is likely to be polished continuously when we start to use it. After it is mature enough, we can upstream to PyTorch.

wconstab

i'm wondering about the best way to land this. it seems nice. It's also a big pile of code.

it's not obvious that it is torchtitan specific. should some/all of this go into torch itself?

is there a nice way to decouple a 'core' piece that lands in torch- perhaps

a contextmanager that itself produces a well defined data structure
a clearly separate plugin that takes the data and renders it
in this way, you can keep the html stuff out of tree

tianyu-l · 2026-02-07T07:21:32Z

I think it partially addressed the pain point of sharding info not being integral part of nn.Module, in the way that we explicitly insert hooks to record what happens, somewhat similar to

CommDebugMode integration [torchtitan][debug] integrated CommDebugMode into TorchTitan #480 made by @anshul-si earlier.
DebugMode which is more about aten level recording.

Can we merge/iterate with one of those tools in pytorch?

fegin · 2026-02-10T01:02:38Z

CommDebugMode integration #480 made by @anshul-si earlier. DebugMode which is more about aten level recording.

@tianyu-l
Correct, I'm aware of CommDebugMode, but as you mentioned, it is more about aten level, which is not easy to understand the sharding of a module. I thought about merging the two but currently no clear idea how to make the two coexist in CommDebugMode without making UX too bad or the outputs too verbose. But I need it for debugging purposes. So I would prefer landing in TorchTitan and iterate the UX and upstream it later when it is more general. But I can land it to full_dtensor branch first since it is not general enough and we have a separate branch for development purposes.

cc., @wconstab @wwwjn The same answer to your questions as well.

wwwjn · 2026-02-10T08:01:51Z

CommDebugMode integration #480 made by @anshul-si earlier. DebugMode which is more about aten level recording.

@tianyu-l Correct, I'm aware of CommDebugMode, but as you mentioned, it is more about aten level, which is not easy to understand the sharding of a module. I thought about merging the two but currently no clear idea how to make the two coexist in CommDebugMode without making UX too bad or the outputs too verbose. But I need it for debugging purposes. So I would prefer landing in TorchTitan and iterate the UX and upstream it later when it is more general. But I can land it to full_dtensor branch first since it is not general enough and we have a separate branch for development purposes.

cc., @wconstab @wwwjn The same answer to your questions as well.

I think we could put it in a separate branch for now. And using html might be too burdensome as most of the info are structured, can we put it simply in a json file? That would also simplify the land process

fegin · 2026-02-10T23:41:01Z

@wwwjn Let's put it in anther branch for now. But I'm not sure if I agree with the html part. html is more readable even though this PR also generates ascii file, I merely use it for programming verification purpose. Mostly I read html. tlparse also use html. Let's iterate in another branch to understand what's the best way to upstream the tool.

anshul-si · 2026-02-11T00:19:18Z

@fegin I'm unsure how much work it would be to integrate into CommDebugMode.

As you can see, commdebugmode does include some of the information you're already trying to output. Furthermore, commdebugmode uses noise levels to control how much information is output. In this case, you could either just create a different output function just for your output, or make your work the minimum output noise-level where no ops are shown, its just module sharding information. In addition, there technically was a html
you could use, but i dont think it was ever landed, though claude should be able to do this part pretty easily

fegin · 2026-02-11T00:57:19Z

@anshul-si Yes, I saw that before :) But the main information I need is simply the input, state, and output sharding of a module. I only see state sharding. I guess one can try to infer from the aten order. But it is better to explicitly capture this information.

I think one way going forward is to enhance CommDebugMode to capture input and output sharding. The rest will just be UX improvement, like having an option to filter out aten ops, an option to fold the repeat modules, etc.

Update

bf326d2

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners February 5, 2026 06:28

pytorch-bot bot added the ciflow/8gpu label Feb 5, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 5, 2026

Update

efefd75

[ghstack-poisoned]

wwwjn reviewed Feb 6, 2026

View reviewed changes

fegin requested a review from wwwjn February 6, 2026 04:45

wconstab reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sharding and device mesh debug tool #2328

Implement sharding and device mesh debug tool #2328

fegin commented Feb 5, 2026 •

edited

Loading

Uh oh!

wwwjn left a comment

Uh oh!

fegin commented Feb 6, 2026

Uh oh!

wconstab left a comment

Uh oh!

tianyu-l commented Feb 7, 2026

Uh oh!

fegin commented Feb 10, 2026 •

edited

Loading

Uh oh!

wwwjn commented Feb 10, 2026

Uh oh!

fegin commented Feb 10, 2026

Uh oh!

anshul-si commented Feb 11, 2026

Uh oh!

fegin commented Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Implement sharding and device mesh debug tool #2328

Are you sure you want to change the base?

Implement sharding and device mesh debug tool #2328

Conversation

fegin commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented Feb 6, 2026

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Feb 7, 2026

Uh oh!

fegin commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wwwjn commented Feb 10, 2026

Uh oh!

fegin commented Feb 10, 2026

Uh oh!

anshul-si commented Feb 11, 2026

Uh oh!

fegin commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented Feb 5, 2026 •

edited

Loading

fegin commented Feb 10, 2026 •

edited

Loading

fegin commented Feb 11, 2026 •

edited

Loading